Minimizing staleness in real-time data warehouses

ABSTRACT

Data tables in data warehouses are updated to minimize staleness and stretch of the data tables. New data is received from external sources and, in response, update requests are generated. Accumulated update requests may be batched. Data tables may be weighted to affect the order in which update requests are serviced.

BACKGROUND

1. Field of the Disclosure

The present disclosure relates to updating data warehouses.

2. Description of the Related Art

Data warehouses store data tables that contain data received fromexternal sources. As an example, the data may relate to networkperformance parameters.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a data warehouse server that receives data andupdates data tables in accordance with disclosed embodiments;

FIG. 2 depicts elements of a method for updating data tables in a datawarehouse in accordance with disclosed embodiments;

FIG. 3 illustrates selected elements of a data processing systemprovisioned as a data warehouse server for updating data tables inaccordance with disclosed embodiments;

FIG. 4 a is a graph of staleness values related to data tables in a datawarehouse;

FIG. 4 b is a further graph of staleness values related to data tablesin a data warehouse; and

FIG. 5 illustrates algorithms related to minimizing staleness values fordata tables.

DESCRIPTION OF EXEMPLARY EMBODIMENTS

In a particular embodiment, a disclosed method updates data tablesstored in a data warehouse. The data warehouse may be a real-time datawarehouse. The method includes receiving data for updating the datatables, generating update requests responsive to the receiving,calculating a staleness for a portion of the data tables, and schedulingdata table updates on a plurality of processors based at least in parton the calculated staleness and the update requests. The method furtherincludes transforming the data tables based on the scheduled data tableupdates to include a portion of the received data.

Generally, the staleness is indicative of an amount of time elapsedsince the previous update of the data tables. Update requests may beassumed non-preemptible and accumulated update requests are batchedtogether. The method may further include determining a stretch value forthe update request, wherein the stretch value is indicative of themaximum ratio between the duration of time an update waits until it isfinished being processed and the length of the update.

Further embodiments relate to a server for managing a data warehouse.The server includes a memory for storing the data warehouse, whichincludes a plurality of data tables. An interface receives data forupdating the data tables and a processor for calculating a staleness fora portion of the data tables responsive to receiving data on theinterface. Further instructions are for weighting a portion of thecalculated stalenesses and scheduling data table updates for completionby a plurality of processors based at least in part on the weightedstalenesses. Accumulated update requests from the generated updaterequests are batched together.

To provide further understanding of disclosed systems, data warehousesand aspects related to updating data tables are discussed. Datawarehouses integrate information from multiple operational databases toenable complex business analyses. In traditional applications,warehouses are updated periodically (e.g., every night) and dataanalysis is done off-line. In contrast, real-time warehouses continuallyload incoming data feeds for applications that perform time-criticalanalyses. For instance, a large Internet Service Provider (ISP) maycollect streams of network configuration, performance, and alarm data.New data must be loaded in a timely manner and correlated againsthistorical data to quickly identify network anomalies, denial-of-serviceattacks, and inconsistencies among protocol layers. Similarly, on-linestock trading applications may discover profit opportunities bycomparing recent transactions against historical trends. Finally, banksmay be interested in analyzing streams of credit card transactions inreal-time to protect customers against identity theft.

The effectiveness of a real-time warehouse depends on its ability tomake newly arrived data available for querying. Disclosed embodimentsrelate to algorithms for scheduling updates in a real-time datawarehouse in a way that 1) minimizes data staleness and 2) under certainconditions, ensures that the “stretch” (delay) of each update task isbounded. In some cases, disclosed systems seek to schedule the updatingof data tables to occur within a constant factor of an optimal solutionfor minimizing staleness and stretch.

Data warehouses maintain sets of data tables that may receive updates inan online fashion. The number of external sources may be large. Thearrival of a new set of data records may generate an update request toappend the new data to the corresponding table(s). If multiple updaterequests have accumulated for a given table, the update requests arebatched together before being loaded. Update requests may belong-running and are typically non-preemptible, which suggests that itmay be difficult to suspend a data load, especially if it involves acomplex extract transform-load process. There may be a number pprocessors available for performing update requests. At any time t, if atable has been updated with data up to time r (i.e., the most recentupdate request arrived at time r), its staleness is t−r.

Given the above constraints, some embodied systems solve the problem ofnon-preemptively scheduling the update requests on p processors in a waythat minimizes the total staleness of all the tables over time. If sometables are more important than others, scheduling may occur toprioritize updates to important tables and thereby minimize“priority-weighted” staleness.

Some disclosed systems use scheduling algorithms to minimize stalenessand weighted staleness of the data in a real-time warehouse, and tobound the maximum stretch that any individual update may experience.

On-line non-preemptive algorithms that are not voluntarily idle canachieve an almost optimal bound on total staleness. Total weightedstaleness may be bounded in a semi-offline model if tables can beclustered into a “small” number of groups such that the updatefrequencies within each group vary by at most a constant factor.

In the following description, details are set forth by way of example tofacilitate discussion of the disclosed subject matter. It should beapparent to a person of ordinary skill in the art, however, that thedisclosed embodiments are exemplary and not exhaustive of all possibleembodiments. Throughout this disclosure, a hyphenated form of areference numeral refers to a specific instance of an element and theun-hyphenated form of the reference numeral refers to the elementgenerically or collectively. Thus, for example, widget 12-1 refers to aninstance of a widget class, which may be referred to collectively aswidgets 12 and any one of which may be referred to generically as awidget 12.

FIG. 1 illustrates system 100 that includes data warehouse server 118,which as shown maintains a real-time data warehouse. Data warehouseserver 118 receives data (over interface 120 through network 106 fromserver 127) from multiple external sources including data processingsystem 125, storage server 121, and mail server 123. The received datais stored as new data 114 by processor 110, which also generates updaterequests 116. Data warehouse server 118 accesses a computer readablemedium (not depicted) embedded with computer instructions for managingdata tables 108. A particular embodiment includes instructions thatmaintain data tables 108 as a data warehouse and receive requests withnew data 114 for updating a portion of data tables 108. Furtherinstructions generate update requests 116 that correspond to thereceived data. Staleness values are calculated for individual datatables of data tables 108. For example, staleness values can becalculated for data tables 108-2 and 108-1. The calculated stalenessescan be ranked or compared to a threshold, as examples. Furtherinstructions schedule updating of data tables 108 with updates 102. Thescheduling of updating data tables 108 may be based on the rankedstalenesses.

With the addition of updates 102, portions of data tables 108 aretransformed based on the scheduling and the update requests 116. Updaterequests 116, in some embodiments, are for appending new data 114 todata tables 108 as updates 102.

In some embodiments, a stretch value for one or more of data tables 108is determined and ranking data tables 108 is based at least in part onthe calculated stretch. In some embodiments, the calculated stretchvalue the maximum ratio between the duration of time an update (e.g., anupdate based on update request 116-3) waits until it is finished beingprocessed and the length of the update. Data warehouse server 118 maybatch together accumulated portions of the generated update requests116. New data 114 is distributed to data tables 108 as updates 102 bythe scheduling processors 109 to minimize staleness. As shown, there area number p processors, which is indicated by scheduling processor 109-p.

In some embodiments, update requests 116 are assumed non-preemptible. Aportion of data tables 108 may be analyzed for a staleness valueindicative of an amount of time elapsed since a previous update of theportion of data tables. A first portion of data tables 108 (e.g., datatable 108-1) may be weighted higher than a second portion of the datatables (e.g., data table 108-2), and the scheduling processors 109 canschedule updates to these data tables responsive to the weightingresults.

Illustrated in FIG. 2 is a method 200 is for updating data tables (e.g.,data tables 108 in FIG. 1). Data is received (block 201) for updatingthe data tables. Responsive to receiving (block 201) the data, updaterequests are generated (block 203). The generated update requests may benon-preemptible. A staleness value is calculated (block 205) for aportion of the data tables. Optionally, the staleness values areweighted (block 207) and a stretch value is calculated (block 209) fordata tables. Weighting may occur by multiplying a first data tablestaleness by a first weight and multiplying a second data tablestaleness by a second weight. The stretch value can be calculated whichindicates the maximum ratio between the duration of time an update waitsuntil it is finished being processed and the length of the update. Adetermination is made (block 211) whether there are accumulated updaterequests. If there are accumulated update requests, the accumulatedupdate requests are batched (block 213). Data table updates arescheduled (block 215) for the update requests (as shown, whether batchedor not) based at least in part on the calculated stalenesses. Theupdates may be scheduled to occur at variable intervals.

Referring now to FIG. 3, data processing system 321 is provisioned as aserver for managing a data warehouse. As shown, the server includes acomputer readable media 311 for storing the data warehouse 301 which hasa plurality of data tables 313. The server further has an input/outputinterface 315 for receiving data for updating the data tables and aprocessor 317 that is enabled by computer readable instructions storedin computer readable media 311. Processor 317 is coupled via shared bus323 to memory 319, input/output interface 315, and computer readablemedia 311. It will be noted that memory 319 is a form of computerreadable media 311. In operation, responsive to input/output interface315 receiving new data, staleness calculation module 303 calculates astaleness for a portion of data tables 313. Staleness weighting module304 optionally weights a portion of the calculated stalenesses.Staleness ranking module 307 ranks the stalenesses. Update schedulingmodule 309 schedules data table updates for completion by a plurality ofprocessors based at least in part on the weighted stalenesses. Updaterequest generation module 302 generates update requests for newlyreceived data. In some embodiments, update requests may be batchedtogether by batching update requests module 305. Stretch calculationsmay be performed for a portion of the data tables 313 by stretchcalculation module 306. Accordingly, update scheduling module 309 mayschedule data table updates based at least in part on the stretch value.The calculated stretch value, in some embodiments, represents themaximum ratio between the duration of time an update waits until it isfinished being processed and the length of the update.

In disclosed methods including method 200, updating the data tables mayinclude appending new data to corresponding data tables. The stalenesscan be indicative of an amount of time elapsed since a previous updateof the data tables. Some embodied methods include scheduling data tableupdates on p processors based at least in part on the calculatedstaleness and the update requests. Updating the data tables with the newdata transforms the data tables based on the scheduled data tableupdates to include a portion of received data. Disclosed methods mayinclude weighting a first portion of the data tables higher than asecond portion of the data tables, wherein the scheduling is at least inpart based upon the weighting.

Further embodiments relate to computer instructions stored on a computerreadable medium for managing a plurality of data tables in a datawarehouse. The computer instructions enable a processor to maintain aplurality of data tables in the data warehouse, receive data forupdating a portion of the plurality of data tables, and generate updaterequests corresponding to the received data. For individual data tables,a staleness for the data table is calculated and ranked. Updating thedata tables is scheduled based on the ranked stalenesses, and the datatable is transformed (e.g., appended with new data) based on theupdating. A stretch value may be calculated for a portion of individualdata tables. The stretch value is indicative the maximum ratio betweenthe duration of time an update waits until it is finished beingprocessed and the length of the update. Further instructions allow foraccumulated update requests to be batched and processed together. Aportion of update requests may be non-preemptible.

Calculations and other aspects of updating data warehouses are touchedon for a better understanding of disclosed systems. Suppose a datawarehouse consists of t tables and p processors, and that p≦t. Eachtable i receives update requests at times r_(i1)<r_(i2)< . . .<r_(i,ki), where r_(i0)=0<r_(i1). An update request at time r_(ij)contains data generated between times r_(i,j−1) and r_(ij). The lengthof this update is defined as r_(ij)−r_(i,j−1). Associated with eachtable i is a real α_(i)≦1 such that processing an update of length Ltakes time at most α_(i)L. The constants α_(i) need not be the same. Forexample, some data feeds may produce more data records per unit time,meaning that updates will take longer to load. At any point in time, anidle processor may decide which table it wants to update, provided thatat least one update request for this table is pending. At time τ, tablei may be picked, and the most recently loaded update may arrive at timer_(ij). A processor would need to non-preemptively perform all theupdate requests for table i that have arrived between time r_(ij)+1 andτ, and there could be one or more pending requests. These pending updaterequests may be referred to as a “batch” with its length defined as thesum of the lengths of the pending update requests. In practice, batchingmay be more efficient than separate execution of each pending update.When the entire batch has been loaded, the processor can choose the nexttable to update.

At any time τ, the staleness of table i is defined τ−r, where r is thearrival time of the most recent update request rij that has been loaded.FIG. 4( a) illustrates the behavior of the staleness function of table iover time. The total staleness for this table is simply the area underthe staleness curve. Suppose that table i is initialized at time ri₀=0.Let rsij and rf_(ij) denote the times that update r_(ij) starts andfinishes processing, respectively. As can be seen, staleness accrueslinearly until the first update is loaded at time rf_(i1). At this time,staleness does not drop to zero; instead, it drops to rf_(i1)−r_(i1). Incontrast, FIG. 4( b) shows the staleness of table i assuming that thefirst two updates were batched. In this case, rs_(i2)=rs_(i1) andrf_(i2)=rf_(i1); conceptually, both update requests start and finishexecution at the same times. Observe that staleness accrues linearlyuntil the entire batch has been loaded. Clearly, total staleness ishigher in FIG. 4( b) because the first update has been delayed.

The flow time of a task can be defined as the difference between itscompletion time and release time, and the stretch of a task is the ratioof its processing time to the flow time. Stretch measures the delay ofthe task relative to its processing time. These definitions may beslightly modified for various update tasks disclosed herein. Forexample, the flow time of an update request arriving at time r_(ij) maybe redefined as the “standard” flow time plus the length of the update,i.e., rs_(ij)−r_(i,j−1). Furthermore, the stretch of said update requestmay be redefined as the “standard” stretch plus the length of theupdate, i.e.:

$\frac{\left( {{rf}_{ij} - {rs}_{ij}} \right) + \left( {r_{ij} - r_{i,{j - 1}}} \right)}{{rf}_{ij} - r_{i,{j - 1}}}.$

Given the above definitions, the total staleness of table i in some timeinterval is upper-bounded by the sum of squares of the flow times (usingthe modified definition of flow time) of update requests in thatinterval. There are no known competitive algorithms for minimizing theL2 norm of “standard” flow time; however, any non-preemptive algorithmthat is not voluntarily idle may nearly achieve the optimistic lowerbound on total staleness.

There are competitive algorithms for minimizing the L₂ norm of flowtimes of all the update requests, using the modified definition of flowtimes defined previously. An algorithm is so-called “opportunistic” ifit leaves no processor idle while a performable batch exists.

For any fixed β and δ such that 0<β,δ<1, C_(β,δ)=√δ(1−β)/√6 may bedefined. Given a number p of processors and t of tables, α:=(p/t)min{C_(β,δ),¼} may be defined. Then, provided that each α_(i)<α, thecompetitive ratio of any opportunistic algorithm is at most

$\left( {1 + \alpha} \right)^{2}\frac{1}{\beta^{4}}{\frac{1}{1 - \delta}.}$

Any choice of constant parameters β and δ results in a constantcompetitive ratio. Note that as β→1 and δ→0, α approaches 0 and hencethe competitive ratio approaches 1.

The penalty (i.e., sum of squares of flow times) of a given algorithmmay be compared to a simple lower bound. Let A be the set of allupdates. Independent of their batching and scheduling, each update i oflength a_(i) needs to pay a minimum penalty of a_(i) ². This followsfrom the convexity of the square function. If a set of updates arebatched together, it may be necessary to pay no less than the sum of thesquares of the updates' lengths. Therefore,

$\begin{matrix}{{LOW}:={{\sum\limits_{i \in A}a_{i}^{2}} \leq {{OPT}.}}} & (1)\end{matrix}$

Looking at the cost a particular solution is paying, let B be the set ofbatches in the solution. For a batch B_(i)εB, let J_(i) be the firstupdate in the batch, having length c_(i). This batch is not applieduntil, for example, d_(i) time units have passed since the release ofJ_(i). The interval of size d_(i) starting from the release of updateJ_(i) is called the “delay interval” of the batch B_(i), and d_(i) iscalled the “delay” for this batch. As for the length of the batch,denoted by b_(i), the following applies:c _(i) ≦b _(i) ≦c _(i) +d _(i).  (2)

For the penalty of this batch, denoted by ρ_(i), the following applies:

$\begin{matrix}{{\rho_{i} = \left\lbrack {\left( {c_{i} + d_{i}} \right) + {\alpha\; b_{i}}} \right\rbrack^{2}},\mspace{14mu}{{by}\mspace{14mu}{the}\mspace{14mu}{definition}\mspace{14mu}{of}\mspace{14mu}{penalty}},} & (3) \\{\leq {\left( {1 + \alpha} \right)^{2}\left( {c_{i} + d_{i}} \right)^{2}\mspace{14mu}{by}\mspace{14mu}{(2).}}} & (4)\end{matrix}$

Considering the case of one processor and two tables, if the updates ofone table receive a large delay, it may indicate that the processor wasbusy applying updates from the other table (because it may be desirablefor an applied algorithm to avoid remaining idle if it can performsomething). Therefore, these jobs which are blocking the updates fromthe first table can pay (using their own sum-of-squares budgetoriginally coming from the lower bound on OPT) for the penalty of adisclosed solution. It may be problematic if updates from the othertables (which are responsible for the payment) might be very smallpieces whose sum of squares is not large enough to make up for the delayexperienced (consider that the sum of their unsquared values iscomparable to the delay amount). There may only be two batches, one ofwhich is preferably large, occurring while a job of table 1 is beingdelayed. Another caveat is that the budget—the lower bound (LOW)—isΣ_(iεA)a_(i) ², rather than Σ_(iεB)b_(i) ², which may be much larger. Ifeach of these batches is not much larger than its first piece (i.e.,b_(i)=Θ(c_(i))), then by losing a constant factor, the length of thebatch can be ignored, and job sizes may be adjusted. Otherwise, thisbatch has a large delay and some other batches should be responsible forthis large delay. It may be preferable to have those other batches payfor the current batch's delay. This indirection might have more than onelevel, but may not be circular.

Each job iεA has a budget of a_(i) ² units. A batch B_(i)εB by (4), mayneed to secure a budget which is proportional to (c_(i)+d_(i))². Theseconditions may be relaxed slightly in the following: A so-called“charging scheme” specifies what fraction of its budget each job pays toa certain batch. Let a batch B_(i) be “tardy” if c_(i)<β(c_(i)+d_(i))(where β comes from the statement above); otherwise it is “punctual.”Let us denote these sets by B_(t) and B_(p) respectively. More formally,a charging scheme is a matrix (v_(ij)) of nonnegative values, wherev_(ij) shows the extent of dependence of batch i on the budget availableto batch j, with the following two properties.

1. For any batch B_(i)εB,

$\begin{matrix}{{\left( {c_{i} + d_{i}} \right)^{2} \leq {\sum\limits_{j \in B_{p}}{v_{ij}b_{j}^{2}}}},{and}} & (5)\end{matrix}$

2. There exists a constant λ>0 such that, for any punctual batch B_(j),

$\begin{matrix}{{\sum\limits_{i \in B}v_{ij}} \leq {\lambda.}} & (6)\end{matrix}$

The existence of a charging scheme with parameters β and λ gives acompetitive ratio of at most

$\left( {1 + \alpha} \right)^{2}\frac{1}{\beta^{2}}\lambda$

for an opportunistic algorithm.

Let:

$\begin{matrix}{{\rho_{i} \leq {\left( {1 + \alpha} \right)^{2}\left( {c_{i} + d_{i}} \right)^{2}\mspace{34mu}{by}\mspace{14mu}(4)}};} & (7) \\{{\leq {\left( {1 + \alpha} \right)^{2}{\sum\limits_{j \in B_{p}}{v_{ij}b_{j}^{2}\mspace{56mu}{by}\mspace{14mu}(5)}}}};} & (8) \\{\leq {\left( {1 + \alpha} \right)^{2}{\sum\limits_{j \in B_{p}}{v_{ij}\frac{1}{\beta^{2}}c_{j}^{2}\mspace{20mu}{by}\mspace{14mu}(2)\mspace{14mu}{and}\mspace{14mu}{definition}\mspace{14mu}{of}\mspace{14mu}{{punctuality}.}}}}} & (9)\end{matrix}$

Hence, the total cost of a solution is

$\begin{matrix}{{\sum\limits_{i \in B}\rho_{i}} \leq {\left( {1 + \alpha} \right)^{2}\frac{1}{\beta^{2}}{\sum\limits_{j \in B_{p}}{c_{j}^{2}{\sum\limits_{i \in B}{v_{ij}\mspace{14mu}{by}\mspace{14mu}(9)}}}}}} & (10) \\{\leq {\left( {1 + \alpha} \right)^{2}\frac{1}{\beta^{2}}\lambda{\sum\limits_{j \in B_{p}}{c_{j}^{2}\mspace{130mu}{by}\mspace{14mu}(6)}}}} & (11) \\{\leq {\left( {1 + \alpha} \right)^{2}\frac{1}{\beta^{2}}{\lambda \cdot {LOW}}\mspace{130mu}{by}\mspace{14mu}(1)}} & (12) \\{\leq {\left( {1 + \alpha} \right)^{2}\frac{1}{\beta^{2}}{\lambda \cdot {OPT}}\mspace{140mu}{by}\mspace{14mu}{(1).}}} & (13)\end{matrix}$

The “execution interval of a batch B” may be defined as the timeinterval during which the batch B is being processed. Accordingly, itslength is α_(i) times the length of B, if the update belongs to table i.

Batch B blocks batch B′ if B's execution interval has intersection ofpositive length with the delay interval of B′. Note that many batchescan block a given batch B′. A charging scheme with the desiredproperties may be introduced. This is done by defining how v_(ij) valuesare computed. If a batch B_(i) is punctual, this may be relativelystraightforward: all v_(ij) values are zero except for v_(ii)=1/β². Takea tardy batch B_(i). In this case d_(i) is large compared to c_(i).During the interval of length d_(i), during which J_(i) is waiting(i.e., the delay interval of batch B_(i)), all p processors should bebusy. Let [r, r′] denote this interval. The total time is pd_(i). Arelaxed version of this bound may be used to draw the conclusion (e.g.,established by equation (6)).

A weighted directed graph with one node for each batch may be built.Punctual batches may be denoted as sinks (i.e., denoted as having nooutarcs). Any tardy batch has arcs to all the batches blocking it, andthere is at least one, since it has positive d_(i). Even though punctualbatches may be blocked by other batches, they have no outarcs.

The result is a so-called “directed acyclic graph” (DAG), because alongany directed path in the graph, the starting (execution) times ofbatches are decreasing. The weight w_(e) on any such arc e is thefraction, between 0 and 1, of the blocking batch which is inside thedelay interval [r, r′] of the blocked batch). Also, there is a parameterγ,

$\begin{matrix}{\gamma:={\frac{6t\;\alpha^{2}}{p^{2}}{\left( \frac{1}{1 - \beta} \right)^{2}.}}} & (14)\end{matrix}$

Then, for any two batches i and j, v_(ij) is defined as

$\begin{matrix}{{v_{ij} = {\frac{1}{\beta^{2}}{\sum\limits_{p \in P_{ij}}{\prod\limits_{e \in p}\left( {\gamma\; w_{e}^{2}} \right)}}}},} & (15)\end{matrix}$

where P_(ij) denotes the set of directed paths from i to j. Thedependence along any path is the square of the product of weights on thepath multiplied by γ to the power of the length of the path. Thisdefinition includes as a special case the definition of the v_(ij)'s forpunctual batches i, since there is a path of length zero between anybatch i and itself (giving v_(ii)=1/β²) and no path from batch i tobatch j for any j≠i (giving v_(ij)=0 if j≠i).

Such a charging scheme may satisfy the desired properties as shown: Thecost paid for each batch should be accounted for using the budget itsecures, as required above in (5).

For any batch B_(i)εB,

${\left( {c_{i} + d_{i}} \right)^{2} \leq {\sum\limits_{j \in B_{p}}{v_{ij}b_{j}^{2}}}},$

If B₁, . . . , B_(k) are the children of B₀, having weights w₁, . . . ,w_(k), respectively, in a run of any opportunistic algorithm, then

${\sum\limits_{j^{\prime} = 1}^{k}\left( {w_{0j^{\prime}}b_{j^{\prime}}} \right)^{2}} \geq {\frac{p^{2}d_{0}^{2}}{6t\;\alpha^{2}}.}$

By the definition of the w_(e)'s, the construction of the graph, thefact that B₁, B₂, . . . , B_(k) are all the batches blocking B₀, and thefact that the k blocking batches are run on p processors in a delayinterval of length d₀ (so that their actual lengths must sum to at least1/α times as much), it can be expressed that:

$\begin{matrix}{{\sum\limits_{j^{\prime} = 1}^{k}\left( {w_{0_{j^{\prime}}}b_{j^{\prime}}} \right)^{2}} = {{pd}_{0} \cdot {1/{\alpha.}}}} & (16)\end{matrix}$

All but 3t of the batches may be removed, such that the sum of sizes ofthe remaining batches is at least ¾ times as large. Let [r, r′] be thedelay interval corresponding to batch B₀. There may be one batch perprocessor whose process starts before r and does not finish until afterr. At most p batches may be kept, and in addition, the first at-most-twobatches for each table that intersect with this interval may be kept.The contribution of the other batches, however many they might be, maybe small. Consider the third (and higher) batches performed in thisinterval corresponding to a particular table. Their original tasks startno earlier than r and their release times do not exceed r′. The formeris true, since otherwise, such pieces would be part of the first orsecond batch of this particular table; call them B₁ and B₂. However,suppose there exists an update J that starts before r and is notincluded in B₁ or B₂. As it is not included in B₁ it preferably wouldhave been released after the start of B₁. The batch B₂ cannot includeany update released before r. So if it does not contain J, it should beempty, which is a contradiction. Hence, the total length of thesebatches (third and later) is no more than d₀, as they only include jobswhose start and end times are inside the delay interval [r, r′]. Nowtd ₀ ≦pd ₀/(4α), by definition of α.  (17)

In conjunction with (16), the total (unsquared) length of the remainingat-most-3t batches is at least pd₀/α−(¼)pd₀/α=(¾)pd₀/α. Considering thatgenerally:

$\begin{matrix}{{\sum\limits_{i = 1}^{N}x_{i}^{2}} \geq \frac{\left( {\sum\limits_{i = 1}^{N}x_{i}} \right)^{2}}{N}} & (18)\end{matrix}$

it may be inferred that the sum of squares of the at-most-3t leftovertasks is at least (¾pd₀α)²/(3t), which exceeds p²d₀ ²/(6tα²).

To show that each batch receives a sufficient budget, let the “depth” ofa node be the maximum number of arcs on a path from that node to a nodeof outdegree 0. The punctual nodes are the only nodes of outdegree 0.Induction on the depth of nodes may be used to prove, for any node B_(i)of depth at most Δ, that

$\left( {c_{i} + d_{i}} \right) \leq {\sum\limits_{j \in \; B_{p}}{v_{ij}b_{j}^{2}}}$

For sinks, i.e., nodes of outdegree 0, the claim is apparent, since

$\begin{matrix}\begin{matrix}{\left( {c_{i} + d_{i}} \right)^{2} \leq {\frac{1}{\beta^{2}}c_{i}^{2}}} & {{by}\mspace{14mu}{definition}\mspace{14mu}{of}\mspace{14mu}{punctuality}}\end{matrix} & (19) \\\begin{matrix}{\leq {\frac{1}{\beta^{2}}b_{i}^{2}}} & {{by}\mspace{14mu}(2)}\end{matrix} & (20) \\\begin{matrix}{{= {v_{ii}b_{i}^{2}\mspace{14mu}{and}}}\;} & {{by}\mspace{14mu}{definition}\mspace{14mu}{of}\mspace{14mu} v_{ii}}\end{matrix} & (21) \\\begin{matrix}\begin{matrix}{= \sum\limits_{j \in \; B_{p}}} & {v_{ij}b_{j}^{2}}\end{matrix} & {\;{{{because}\mspace{14mu} v_{ij}} = {{0\mspace{14mu}{if}\mspace{14mu} j} \neq {i.}}}}\end{matrix} & (22)\end{matrix}$

Take a tardy batch B₀ of depth Δ whose immediate children are B₁, . . ., B_(k). For any child B_(i) of B₀, whose depth has to be less than Δ,there is the following:

$\begin{matrix}{b_{i}^{2} \leq \left( {c_{i} + d_{i}} \right)^{2}} & (23) \\{\leq {\sum\limits_{j \in \; B_{p}}{v_{ij}b_{j}^{2}}}} & (24)\end{matrix}$

Now we prove the inductive assertion as follows.

$\begin{matrix}{\left( {c_{0} + d_{0}} \right)^{2} \leq {\left( \frac{1}{1 - \beta} \right)^{2}d_{0}^{2}\mspace{14mu}{by}\mspace{14mu}{the}\mspace{14mu}{definition}\mspace{14mu}{of}\mspace{14mu}{tardiness}}} & (25) \\{= {\gamma\frac{p^{2}d_{0}^{2}}{6\; t\;\alpha^{2}}\mspace{14mu}{by}\mspace{14mu}{the}\mspace{14mu}{choice}\mspace{14mu}{of}\mspace{14mu}\lambda}} & (26) \\{{\leq {\gamma{\sum\limits_{j^{\prime} = 1}^{k}{\left( {w_{0j^{\prime}}b_{j}} \right)^{2}\mspace{14mu}{from}\mspace{14mu}{above}}}}};\mspace{14mu}{and}} & (27) \\{{\leq {\sum\limits_{j^{\prime} = 1}^{k}{\gamma\; w_{0^{j^{\prime}}}^{2}{\sum\limits_{j \in \; B_{p}}{v_{j^{\prime}j}b_{j}^{2}\mspace{14mu}{by}\mspace{14mu}(23)\mspace{14mu}{and}\mspace{14mu}(24)\mspace{14mu}{{above}.}}}}}}\;} & (28)\end{matrix}$

Therefore:

$\begin{matrix}{\leq {\sum\limits_{j \in \; B_{p}}{v_{0j}b_{j}^{2}}}} & (29)\end{matrix}$from (15) above, and because for jεB_(p), the first arc of the paths canbe factored out to get

$\begin{matrix}{{v_{oj} = {\sum\limits_{j^{\prime} = 1}^{k}{\left( {\gamma\; w_{0^{j^{\prime}}}^{2}} \right)v_{j^{\prime}j}}}};} & (30)\end{matrix}$

More precisely, let P_(e,j), for an arc e=(u, v) and a node j, be theset of all directed paths from u to j whose second node is v. Then,

$\begin{matrix}{v_{oj} = {\frac{1}{\beta^{2}}{\sum\limits_{p \in P_{0j}}{\coprod\limits_{e \in p}{\left( {\gamma\; w_{e}^{2}} \right)\mspace{14mu}{by}\mspace{14mu}(15)}}}}} & (31) \\{= {{\sum\limits_{j^{\prime} = 1}^{k}{\frac{1}{\beta^{2}}{\sum\limits_{p \in P_{j^{\prime}j}}{\coprod\limits_{e \in p}{\left( {\gamma\; w_{e}^{2}} \right)\mspace{14mu}{since}\mspace{14mu} P_{oj}}}}}} = {{{\overset{k}{\bigcup\limits_{j^{\prime} = 1}}{P_{{ej}^{\prime},j}\mspace{14mu}{and}\mspace{14mu} P_{e,j}}}\bigcap P_{e^{\prime},j}} = {{\varnothing\mspace{14mu}{for}\mspace{14mu} e} \neq e^{\prime}}}}} & (32) \\{= {\sum\limits_{j^{\prime} = 1}^{k}{\gamma\; w_{e_{j^{\prime}}}^{2}\frac{1}{\beta^{2}}{\sum\limits_{p \in P_{j^{\prime}j}}{\coprod\limits_{e \in p}{\left( {\gamma\; w_{e}^{2}} \right)\mspace{14mu}{and}}}}}}} & (33) \\{= {\sum\limits_{j^{\prime} = 1}^{k}{\left( {\gamma\; w_{0j^{\prime}}^{2}} \right)v_{j^{\prime}j}\mspace{14mu}{by}\mspace{14mu}{(15).}}}} & (34)\end{matrix}$

The second property of a charging scheme says that the budget availableto a batch should not be overused.

For any batch B_(j),

${{\sum\limits_{i \in B}v_{ij}} \leq \lambda}:=\frac{1}{\beta^{2}\left( {1 - {t\;\gamma}} \right)}$

and tγ<1.

The delay intervals corresponding to batches of a single table may bedisjoint, as shown: The delay interval of a batch B_(i)εB starts at theend of J_(i). Suppose this interval intersects one of B_(j),j≠i, fromthe same table. Without loss of generality, assume that J_(j) starts atleast as late as J_(i). Thus, as J_(i) and J_(j) intersect, J_(j) shouldhave been released before the delay interval of B_(i) ends. This is incontradiction with the definition of batching, as it implies J_(j)should be included in batch B_(i).

To demonstrate the second property of the charging scheme, let theheight of a node be the maximum number of arcs on a path from any nodeto that node. Induction on the height of nodes can be used todemonstrate for any node B_(j) of height H,

${\sum\limits_{i \in B}v_{ij}} \leq \lambda$It may be noted that:

$\begin{matrix}{{{t\;\gamma} = {\frac{6t^{2}\alpha^{2}}{p^{2}}\left( \frac{1}{1 - \beta} \right)^{2}\mspace{34mu}{by}\mspace{14mu}(14)\mspace{14mu}{above}}},{and}} & (35) \\{{\leq \delta < {1\mspace{169mu}{by}\mspace{14mu}{definition}\mspace{14mu}{of}\mspace{14mu}\alpha}},{\delta\mspace{14mu}{{above}.}}} & (36)\end{matrix}$

For a batch B_(j) at height zero (a source, i.e., a node of indegree 0),the definition of v_(ij), which involves a sum over all i→j paths, wouldbe 0 unless i=j, in which case v_(ij)=1/β². Now the claim that λ≧1/β²follows from the definition of λ and the fact that tγ<1.

As above, the last arc of the path can be factored out, except for thezero-length trivial path. Consider B₀, whose immediate ancestors are B₁,. . . , B_(k) with arcs e_(i)=(B₁, B₀), . . . , e_(k)=(B_(k), B₀),respectively. These incoming arcs may come from batches corresponding todifferent tables. However, it may be shown that the sum Σ_(i=1)^(k)w_(e) _(i) of the weights of these arcs is at most t. Moreprecisely, it may be shown that the contribution from any table is nomore than one. Consider that w_(ei) denotes the fraction of batch B₀which is in the delay interval of batch B_(i). As the delay intervals ofthese batches are disjoint, as above in some cases, their total weightcannot be more than one and hence the total sum over all tables cannotexceed t.

Further, for any e, it may be shown that w_(e)<1. So

$\begin{matrix}{{\sum\limits_{i = 1}^{k}w_{ei}^{2}} \leq {t.}} & (37)\end{matrix}$

As the height of any ancestor B_(i) of B₀ is strictly less than H, theinductive hypothesis ensures that the total load Σ_(jεBvji) on B_(i) isno more than λ. The total load on B₀ is:

$\begin{matrix}{{\sum\limits_{i \in B}v_{i\; 0}} = {\frac{1}{\beta^{2}} + {\sum\limits_{{i \in B},{i \neq 0}}{\sum\limits_{i^{\prime} = 1}^{k}{\gamma\; w_{i^{\prime}0}^{2}v_{{ii}^{\prime}}}}}}} & (38)\end{matrix}$

by definition of v_(ij) in (15) above, noting that v₀₀=1/β² and the factthat for any i≠0, any path from B_(i) to B₀ visits another batch B_(i′)which is an immediate ancestor of B₀,

$\begin{matrix}{{= {\frac{1}{\beta^{2}} + {\gamma{\sum\limits_{i^{\prime} = 1}^{k}{w_{i^{\prime}0}^{2}\left( {\sum\limits_{i \in B}v_{{ii}^{\prime}}} \right)}}}}},\mspace{14mu}{{by}\mspace{14mu}(38)}} & (39)\end{matrix}$Now

${\sum\limits_{i \in B}v_{{ii}^{\prime}}} \leq \lambda$by the inductive hypothesis applied to i′ and (39)

${\sum\limits_{i^{\prime} = 1}^{k}w_{i^{\prime},0}^{2}} \leq t$by (37), then

$\begin{matrix}{{{\sum\limits_{i \in B}v_{i\; 0}} \leq {\frac{1}{\beta^{2}} + {\gamma\; t\;\lambda}}}{{by}\mspace{14mu}(37)\mspace{14mu}{and}\mspace{14mu}{the}\mspace{14mu}{inductive}\mspace{14mu}{hypothesis}}} & (40) \\{{= \lambda},{{by}\mspace{14mu}{the}\mspace{14mu}{choice}\mspace{14mu}{of}\mspace{14mu}\lambda},} & (41)\end{matrix}$as desired. The matrix (v_(ij)) is a charging scheme, as shown above.

Above the staleness measure is twice between the penalty measure and thelower bound LOW. Since the main theorem shows that these two outervalues are close, staleness should also be close to the lower bound.

It can be argued that LOW is also a lower bound on staleness. Stalenessis an integration on how out-of-date each table is. Tables can beconsidered separately. For each particular table, one can look atportions corresponding to different updates. If an update starts at timer and ends (i.e., is released) at r′, then at point r≦x≦r′, thestaleness is no less than x. Thus, the total integration is at least½Σ_(iεA)a_(i) ². Staleness in most or all cases cannot be larger thanhalf the penalty paid. For each specific table, the time frame ispartitioned into intervals, marked by the times when a batch'sperformance is finished. The integration diagram for each of theseupdates consists of a trapezoid. It can be denoted by y the stalenessvalue at time r. The staleness at r′ is then y+r′−r. Total staleness forthis update is

$\begin{matrix}{\rho^{*} = {\left( {r^{\prime} - r} \right)\frac{y + y + r^{\prime} - r}{2}}} & (42) \\{{\leq \frac{\left( {r^{\prime} - r + y} \right)^{2}}{2}},} & (43)\end{matrix}$as y≧0 and AB≦(A+B/2)² for A,B≧0,=ρ,  (44)

where ρ is the penalty for this batch according to our objective.

There is no known online algorithm which is competitive with respect tostretch. With regard to this, suppose there is one processor and twotables. A large update of size S1 arrives on the first table. At somepoint, it needs to be applied. As soon as this is done, a very smallupdate of size S₂ appears on the second table. Since preemption is notallowed, the small update needs to wait for the large update to finish.The stretch would be at least αS₁/S₂. But if there was advancedknowledge of this, the larger job could be delayed until completion ofthe smaller update.

Even if there is an offline periodic input, the situation might be lessthan optimal. In some cases, stretch can be large. Again, if there aretwo tables and one processor, one table may have a big periodic updateof size S₁. The other table may have small periodic updates of size S₂.At some point, an update from table one should be performed. Someupdates from table two may arrive during this time and need to wait forthe large update to finish. So their stretch is at least a(S₁−S₂)/S₂.

The above examples all work with one processor, but similarconstructions show that with p processors, the stretch can be as largeas desired because it is not bounded. To do this, p+1 tables and pprocessors are needed. The i^(th) table has a period which is muchlarger than the (i+1)^(th) one. The argument above was a special casefor p=1.

The identity of a condition that allows stretch to be bounded is sought.Suppose each table has updates of about the same length (i.e., it issemi-periodic). In other words, the updates from each table have size in[A, cA] for some constant c. Any constant c would work (yet give adifferent bound finally), but c=2 is picked for ease of exposition.Further assume that tables can be divided into a few (g, to be precise)groups, such that periods of updates in each group is about the samething (the same condition for the update lengths being in [A, 2A]). Thenat least as many processors are needed as compared to the number ofgroups. Otherwise, there can be examples to produce arbitrarily largestretch values. Additionally, a reasonable assumption can be made that pis much larger than g. Each group is assigned some processors, in anamount proportional to their load. That is, the number of processorsgiven to each group is proportional to the number of tables in thegroup. The algorithm is given in FIG. 5. After the assignment ofprocessors to groups, each group runs a specific opportunisticalgorithm. This algorithm, at each point when a processor becomes idle,picks the batch corresponding to the oldest update.

At that point, each group forms an independent instance. Let us assumethat for the t′ tables in a specific group with p′ processors, the upperbound is α≦p′/8t′ on each α_(i).

It can be shown that if all the updates of a group have sizes between Aand 2A, and α≦p′/8t′, stretch is bounded by 3. Taking any one task, itcan be shown it cannot wait for long, and thus its stretch is small.Note that stretch also considers the effect of batching this task withsome other tasks of the same table. To this end, the execution of tasksis divided into several sections. Each section is either tight or loose.A tight section is a maximal time interval in which all the processorsare busy. Loose is defined in the example as not tight.

Jobs can be ordered according to their release times, and ties may bearbitrarily decided. Let ω_(i) denote the wait time (from release timeto start of processing) of the i^(th) job (say, J_(i)). Let θ_(k) be thelength of the k^(th) tight section (call it S_(k)). Recursive bounds canbe established on values ω_(i) and θ_(k), and then induction may be usedto prove they cannot be too large. There is inter-relationship betweenthem, but the dependence is not circular. Generally, θ_(k) depends onω_(i) for jobs which are released before Sk starts. On the other hand,ω_(i) depends on ω_(i′) for i′<i and θ_(k) for S_(k) in which J_(i) isreleased.

Let i_(k) be the index of the last job released before S_(k) starts. Thetopological order of recursive dependence is then as follows: theω_(i)'s are sorted according to i, and θ_(k) is placed between ω_(i)_(k) and ω_(1+i) _(k) . The recursive formulas developed below relatethe value of each variable to those to its left, and hence, circulardependence is avoided.

To derive a bound on θ_(k), one can look more closely at the batchesprocessed inside S_(k). Let r and r′ be the start and end time of thesection S_(k). These batches correspond to updates which are releasedbefore r′. Let the so-called “load” at time r be the total amount ofupdates (released or not) until time r that has not yet been processed.Part of a batch that is released might have been processed (although itseffect would not have appeared in the system yet). After half ofprocessing time is passed, it can be considered that half of the batchhas been processed. Updates which have not been released, but have astart time before r, may be considered to be part of the load (not allof it, but only the portion before r). The contribution to load by anysingle table is at most 2A+max_(i≦i) _(k) ω_(i). There are at leastthree cases to consider. First, if no batch of the table is completelyavailable at time r, the contribution X≦2A; that is, there can only beone update which has not yet been released. Second, if a batch iswaiting until time r, then X≦2A+max_(i≦i) _(k) ω_(i), since the lengthof the batch is the actual contribution. However, if a batch is runningat time r, let z be the time at which its processing started. In most orall cases, z≦r and the processing of the batch continues up to at leasttime r. The load is

$\begin{matrix}{X \leq {\left( {r - z} \right) + \left( {{2A} + {\max\limits_{i \leq i_{k}}\omega_{i}}} \right) - {\left( {r - z} \right)/\alpha}}} & (45)\end{matrix}$where the first term corresponds to a (possibly not yet released) batchwhich is being formed while the other batch is running, the second termbounds the length of the running batch and the last term takes out theamount of load that is processed during the period from z to r. Notingthat α≦1, equation (45) gives X≦2A+max_(i≦i) _(k) ω_(i).

Hence, the total load at time r is at most t(2A+max_(i≦i) _(k) ω_(i)).Yet, there can be an additional load of tθ_(k) which corresponds to theupdates inside S_(k). Thus, all the batches to be processed during S_(k)can be processed to get:

$\begin{matrix}{{{2A\; t} + {t\;{\max\limits_{i \leq i_{k}}\omega_{i}}} + {t\;\theta_{k}}} \geq {\theta_{k}{p/{\alpha.}}}} & (46)\end{matrix}$

Rearranging yields:

$\begin{matrix}{{\theta_{k} \leq \frac{\left( {{2A} + {\max_{i \leq i_{k}}{+ \omega_{i}}}} \right)t}{\frac{p}{\alpha} - t}},} & (47)\end{matrix}$

Inequalities for ω_(i) may be considered. Without loss of generality,consideration is made of the waiting time for the first update of abatch. It may be noted that they have the largest wait time among allthe updates from the same batch. If a task has to wait, it should haveone of two reasons. Either all the processors are busy; or another batchcorresponding to this table is currently running Consider two cases:

The first case is when J_(i) is released in the loose section beforeS_(k). If ω_(i)>0, it is waiting for another batch from the same table.The length of the batch is at most max_(i′<i)ω_(i′)+2A. If as soon asthis batch is processed at time τ, J_(i) processing may begin. Any jobwith higher priority than J_(i) should have been released before it is.When J_(i) is released, all these other higher-priority jobs are eitherrunning, or waiting for one batch of their own table. So, their countcannot be more than p−2 (there is one processor working on the tablecorresponding to J_(i) and one for each of these higher-priority jobs,and at least one processor is idle). In other words, there are at mostp−2 tables which might have higher priority than J_(i)'s table at timeτ. Thus, J_(i) cannot be further blocked at time τ. Hence,

$\begin{matrix}{\omega_{i} \leq {{\alpha\left( {{\max\limits_{i^{\prime} < \; i}\omega_{i^{\prime}}} + {2A}} \right)}.}} & (48)\end{matrix}$

The other case is when J_(i) is released inside the tight section S_(k).If J_(i) is processed after ω>θ_(k) time units pass since the start ofS_(k), a batch from the same table has to be under processing at themoment S_(k) finishes; otherwise, J_(i) would start at that point.However, processing of this batch must have started before J_(i) wasreleased; or else, J_(i) has to be part of it. Moreover, similarly tothe argument for the first case, it can be shown that as soon as theprocessing of the blocking batch from the same table is done, the batchcorresponding to job J_(i) will start to be processed. More precisely,there can be at most p−1 other batches with higher priorities thanJ_(i). So they cannot block J_(i) at time τ when the lock on its tableis released. Hence,

$\begin{matrix}{{\omega_{i} \leq {\max\left\{ {\theta_{k},{\alpha\left( {{\max\limits_{i^{\prime} < \; i}\omega_{i^{\prime}}} + {2A}} \right)}} \right\}}},} & (49)\end{matrix}$since, it either waits for the tight section to end, or for a batch ofits own table whose length cannot be more than max_(i′<i)ω_(i′)+2A.

One can pick of ω*=θ*=A/3 and use induction to show that ∀i:ω_(i)≦ω* and∀k:θ_(k)≦θ*, in part because α≦⅛ and p/α≧8t. Note that the right-handside of Equations (47), (49) and (48) would be no more than θ*=ω* if onereplaces these values for the w_(i) and θ_(k) values in the formula. Itonly remains to observe that the dependence is indeed acyclic, which isclear by the ordering and by the fact that each formula uses the valuesto its left in the ordering.

The length of a batch is at most A′=A/3≦7/3A, where the first term comesfrom length of the first job in the batch (A≦A′≦2A), and the second termcomes from the bound on its wait time. The resulting maximum stretch forany update piece would be bounded by 7/3(1+α)<3.

The algorithm preferably keeps the stretch low. The algorithm in FIG. 5can keep the stretch below 3 if

$\alpha \leq {\frac{p - g}{8\; t}.}$If the sum of (p−g)|T_(i)|/t for different groups is p−g, thenΣ_(i)[(p−g)|T_(i)|/t≦p. Thus, one would use, at most, as many processorsas were available. Then, in each group p′≧(p−g)t′/t. So the followingapplies: α≦p′/8t′. The arguments above, regarding if all of the updatesof a group have sizes between A and 2A, apply to show that stretch isbounded by 3.

So-called “weighted staleness” can also be considered. That is, eachtable has a weight w_(i) that should be multiplied by the overallstaleness of that table. This takes into account the priority ofdifferent tables. In an online setting, no known algorithm can becompetitive with respect to weighted staleness. As soon as a job fromthe low priority table is scheduled, a job appears from the highpriority table which will then cost too much.

In the semi-periodic instance, weighted staleness of the algorithm inFIG. 5 is no more than nine times that of OPT. If w_(i) is a weights forstaleness definition, Σ_(iεA)w_(i)a_(i) ² is a lower bound on theweighted staleness. As stretch is less than 3, the weighted stalenesscannot be larger than 3²=9 times that of the optimum.

To the maximum extent allowed by law, the scope of the presentdisclosure is to be determined by the broadest permissibleinterpretation of the following claims and their equivalents, and shallnot be restricted or limited to the specific embodiments described inthe foregoing detailed description.

What is claimed is:
 1. A method of updating data tables stored in a datawarehouse, the method comprising: storing, in memory, a plurality ofdata tables; detecting, by a processor, incoming data for updating theplurality of data tables; generating, by the processor, an updaterequest associated with each data table in the plurality of data tables;determining a calculated staleness for a portion of the plurality ofdata tables; scheduling updates to the portion of the plurality of datatables based on the calculated staleness; determining a stretch valuefor each one of the data tables in the portion of the plurality of datatables, the stretch value indicative of a maximum ratio between aduration of time a corresponding one of the updates waits untilprocessing is finished and a length of the corresponding one of theupdates, wherein scheduling data table updates is based at least in parton the stretch value; and distributing the updates among a plurality ofprocessors to minimize the calculated staleness; transforming theportion of the plurality of data tables to include a different portionof the incoming data based on the scheduling.
 2. The method of claim 1,further comprising determining a previous update of the portion of theplurality of data tables.
 3. The method of claim 1, further comprisingdetermining the update request is non-preemptible.
 4. The method ofclaim 1, further comprising batching an accumulation of update requests.5. The method of claim 1, further comprising performing the updates in areal-time data.
 6. The method of claim 5, further comprising responsiveto receiving a user request for the portion of the plurality of datatables.
 7. The method of claim 1, further comprising weighting a firstportion of the plurality of data tables higher than a second portion ofthe plurality of data tables.
 8. The method of claim 1, furthercomprising determining the calculated staleness is a priority weightedstaleness.
 9. The method of claim 8, further comprising multiplying afirst data table staleness by a first weight and multiplying a seconddata table staleness by a second weight.
 10. The method of claim 1,further comprising appending new data to one of the plurality of datatables.
 11. The method of claim 1, further comprising performing theupdates at variable intervals.
 12. A non-transitory computer readablemedium storing computer instructions that when executed cause aprocessor to perform a method for managing a plurality of data tables ina data warehouse, the method comprising: maintaining the plurality ofdata tables in the data warehouse; receiving requests to update, withincoming data, a portion of the plurality of data tables; generatingupdate requests corresponding to the requests to update; determiningcalculated stalenesses for individual data tables of the portion of theplurality of data tables; ranking the calculated stalenesses; schedulingupdates to the portion of the plurality of data tables based on thecalculated stalenesses; determining a stretch value for each one of theindividual data tables in the portion of the plurality of data tables,the stretch value indicative of a maximum ratio between a duration oftime a corresponding one of the updates waits until processing isfinished and a length of the corresponding one of the updates, whereinscheduling data table updates is based at least in part on the stretchvalue; distributing the updates among a plurality of processors tominimize the calculated stalenesses; and transforming the portion of theplurality of data tables to include a different portion of the incomingdata based on scheduling of the updates and the update requests.
 13. Thenon-transitory computer readable medium of claim 12, further comprisingweighting a first portion of the plurality of data tables higher than asecond portion of the plurality of data tables, and wherein thescheduling is at least in part responsive to a result of the weighting.14. The non-transitory computer readable medium of claim 12, furthercomprising appending the incoming data to the portion of the pluralityof data tables.
 15. A server for managing a data warehouse, the servercomprising: a processor; and a memory storing instructions that whenexecuted cause the processor to perform operations, the operationscomprising: receiving incoming data for updating a plurality of datatables; an interface for receiving incoming data for updating theplurality of data tables determining calculated stalenesses for aportion of the plurality of data tables; weighting a portion of thecalculated stalenesses; scheduling updates to the portion of theplurality of data tables based on the calculated stalenesses;determining a stretch value for each one of the individual data tablesin the portion of the plurality of data tables, the stretch valueindicative of a maximum ratio between a duration of time a correspondingone of the updates waits until processing is finished and a length ofthe corresponding one of the updates, wherein scheduling data tableupdates is based at least in part on the stretch value; and distributingthe updates among a plurality of processors to minimize the calculatedstalenesses.
 16. The server of claim 9, wherein the operations furthercomprise batching the updates.